Author: Hang He
GitHub Page: https://GavinHHE.github.io.
Class: CMPS 3160
Dataset Source:
From CDC:
PLACES: Local Data for Better Health, Census Tract Data 2020 release:https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh
From Kaggle:
Cardio Vascular Disease Detection: https://www.kaggle.com/bhadaneeraj/cardio-vascular-disease-detection
Diabetes Health Indicators Dataset: https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset
Code Reference: https://plotly.com/python/choropleth-maps/
Introduction
ETL and EDA
Model Construction and Evaluation
Conclusion
From heart.org, an artical states that nearly half of American adults have high blood pressure. As we know, most of the time, high blood pressure (HBP, or hypertension) has no obvious symptoms to indicate that something is wrong. Articals from CDC also states that only about 1 in 4 adults (24%) with hypertension have their condition under control. It develops slowly over time and can be related to many causes. According to the CDC Heart disease(Cardiovascular disease), cancer and diabetes are currently among the most expensive health conditions in the United States. I believe that it is meaningful to study whether High Blood Pressure is positively related to Diseases: Cardiovascular Disease and Diabetes.
For the project, I will deep dive the health data from CDC and Cardiovascular Disease data from Kaggle by visualization and analysis. The final report will include a visualization of percentage of population that has HBP by states and analysis on the importance of HBP as a risk factor to Cardiovascular Disease and Diabetes. I will also include machine learning models to make predictions using available data. Hopefully, models would be able to predict whether a specific person has cardiovascular Disease or Diabetes accurately.
The main question for my project is "How risky is HBP? Is High Blood Pressure positively related to either Cardiovascular Disease or Diabetes?". In addition to that, I would also like to analyze other important risk factors that are related to both diseases.
All data can be aslo found in the repository.
Census Tract Data 2020 release(2017 to 2018) is filled with data regarding the overall responses of surveys conducted by multiple organizations. Columns in the datasets includes when and where the survey was conducted, total population involved, descriptions of the question asked, and the responses value in percentage. I will use this dataset to visualize HBP rate and perform some basic caulation.
Cardio Vascular Disease Detection is filled with the data regarding people both with and without Cardiovascular Disease. Personal information including age, gender, height, weight, blood pressure measurement and etc. The dataset also have columns indicating smoking, drinking and exercise status. I will also assess those three risk factors in the machine learning part.
Diabetes Health Indicators Dataset is filled with the data regarding people both with and without Diabetes.Columns include whether the person smoking or drinking, age group, education level, income level, gender and etc. There is no missing values in dataset. Many of the columns are catergorical or boolean variables.
For all three dataset, there is no missing value. Data from Census Tract Data 2020 release are clean and do not need futhur data cleaning. There are many extreme values that are unrealistic in Cardio Vascular Disease Detection. I will deal with those values by droping or transforming. Outliers in Diabetes dataset are not removed in the EDA part, but I will transform or drop outliers when doing pridictions. Most outlier in Diabetes Health Indicators Dataset are removed in the EAD part.
here I will load the Census Tract Data 2020 release. There is a description about the columns on the website: https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh. Since I am only interested in the survey results about blood pressure, I will slice the data and choose columns: ['Year','StateAbbr','StateDesc','CountyName','Measure','Data_Value_Unit','Data_Value','Geolocation']
# ETL Process
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('mode.chained_assignment', None)
# import requests
# from bs4 import BeautifulSoup
# from IPython.core.display import display, HTML
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
Health_df= pd.read_csv('C:/Users/hehan/Desktop/cmps3160/proj/finalProj/GavinHHE.github.io/PLACES_Data_for_Better_Health_Data_2020.csv',low_memory=False)
## take rows that are related to HBP only
BloodPressure_df = Health_df[Health_df.Short_Question_Text=='High Blood Pressure'].copy()
BloodPressure_df.head()
| Year | StateAbbr | StateDesc | CountyName | CountyFIPS | LocationName | DataSource | Category | Measure | Data_Value_Unit | ... | Data_Value_Footnote | Low_Confidence_Limit | High_Confidence_Limit | TotalPopulation | Geolocation | LocationID | CategoryID | MeasureId | DataValueTypeID | Short_Question_Text | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3317 | 2017 | AL | Alabama | Crenshaw | 1041 | 1041963600 | BRFSS | Health Outcomes | High blood pressure among adults aged >=18 years | % | ... | NaN | 44.6 | 45.9 | 3180 | POINT (-86.36923855 31.72464193) | 1041963600 | HLTHOUT | BPHIGH | CrdPrv | High Blood Pressure |
| 3329 | 2017 | AL | Alabama | Lauderdale | 1077 | 1077011000 | BRFSS | Health Outcomes | High blood pressure among adults aged >=18 years | % | ... | NaN | 41.2 | 43.3 | 4612 | POINT (-87.6820897 34.82994416) | 1077011000 | HLTHOUT | BPHIGH | CrdPrv | High Blood Pressure |
| 3338 | 2017 | AL | Alabama | Franklin | 1059 | 1059972900 | BRFSS | Health Outcomes | High blood pressure among adults aged >=18 years | % | ... | NaN | 40.3 | 42.4 | 4008 | POINT (-87.61995937 34.52317217) | 1059972900 | HLTHOUT | BPHIGH | CrdPrv | High Blood Pressure |
| 3340 | 2017 | AL | Alabama | Jefferson | 1073 | 1073010803 | BRFSS | Health Outcomes | High blood pressure among adults aged >=18 years | % | ... | NaN | 36.0 | 37.8 | 6514 | POINT (-86.71445129 33.51402095) | 1073010803 | HLTHOUT | BPHIGH | CrdPrv | High Blood Pressure |
| 3363 | 2017 | AL | Alabama | Jefferson | 1073 | 1073010802 | BRFSS | Health Outcomes | High blood pressure among adults aged >=18 years | % | ... | NaN | 31.7 | 33.7 | 3448 | POINT (-86.76308889 33.48895376) | 1073010802 | HLTHOUT | BPHIGH | CrdPrv | High Blood Pressure |
5 rows × 23 columns
## removing columns not need
col = ['Year','StateAbbr','StateDesc','CountyName','Measure','Data_Value_Unit','Data_Value','Geolocation']
BloodPressure_df=BloodPressure_df[col]
BloodPressure_df.reset_index(drop=True,inplace=True)
BloodPressure_df.head()
| Year | StateAbbr | StateDesc | CountyName | Measure | Data_Value_Unit | Data_Value | Geolocation | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2017 | AL | Alabama | Crenshaw | High blood pressure among adults aged >=18 years | % | 45.3 | POINT (-86.36923855 31.72464193) |
| 1 | 2017 | AL | Alabama | Lauderdale | High blood pressure among adults aged >=18 years | % | 42.3 | POINT (-87.6820897 34.82994416) |
| 2 | 2017 | AL | Alabama | Franklin | High blood pressure among adults aged >=18 years | % | 41.4 | POINT (-87.61995937 34.52317217) |
| 3 | 2017 | AL | Alabama | Jefferson | High blood pressure among adults aged >=18 years | % | 36.9 | POINT (-86.71445129 33.51402095) |
| 4 | 2017 | AL | Alabama | Jefferson | High blood pressure among adults aged >=18 years | % | 32.6 | POINT (-86.76308889 33.48895376) |
## Checking for missing values
BloodPressure_df.isna().value_counts()
Year StateAbbr StateDesc CountyName Measure Data_Value_Unit Data_Value Geolocation False False False False False False False False 72337 dtype: int64
## Survey results are recorded by county
## I want to know the mean HBP rate by states
Percent_mean_state = BloodPressure_df.groupby('StateAbbr').Data_Value.mean()
## Top 5 states with highest HBP rate
Percent_mean_state.sort_values(ascending=False)[:5]
StateAbbr WV 42.212190 AL 41.759660 MS 41.332219 LA 40.080249 KY 39.833544 Name: Data_Value, dtype: float64
## Top 5 states with lowest HBP rate
Percent_mean_state.sort_values(ascending=True)[:5]
StateAbbr UT 23.843590 CO 25.433736 MN 26.560495 CA 27.377464 MA 28.436500 Name: Data_Value, dtype: float64
## Transform the pd series to dataframe for futhur analysis
State =[]
Mean_Percentage_HBP =[]
for var in Percent_mean_state:
State.append(Percent_mean_state[Percent_mean_state == var].index[0])
Mean_Percentage_HBP.append(var)
dict = {'State':State,'Mean_Percentage_HBP': Mean_Percentage_HBP}
Percentage_HBP_df = pd.DataFrame(dict)
Percentage_HBP_df.head()
| State | Mean_Percentage_HBP | |
|---|---|---|
| 0 | AK | 30.044910 |
| 1 | AL | 41.759660 |
| 2 | AR | 39.295175 |
| 3 | AZ | 29.618470 |
| 4 | CA | 27.377464 |
## Plot the average HBP rate by state
## reference: https://plotly.com/python/choropleth-maps/
import plotly.graph_objects as go
fig = go.Figure(data=go.Choropleth(
locations=Percentage_HBP_df['State'], # Spatial coordinates
z = Percentage_HBP_df['Mean_Percentage_HBP'].astype(float), # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in `locations`
colorscale = 'Reds',
colorbar_title = "Mean HBP rate",
))
fig.update_layout(
title_text = '2017 US High Blood Pressure rates by State',
geo_scope='usa', # limite map scope to USA
)
fig.show()
Age | Objective Feature | age | int (days)
Height | Objective Feature | height | int (cm) |
Weight | Objective Feature | weight | float (kg) |
Gender | Objective Feature | gender | categorical code |
Systolic blood pressure | Examination Feature | ap_hi | int |
Diastolic blood pressure | Examination Feature | ap_lo | int |
Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
Smoking | Subjective Feature | smoke | binary |
Alcohol intake | Subjective Feature | alco | binary |
Physical activity | Subjective Feature | active | binary |
Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
Disease_df= pd.read_csv('C:/Users/hehan/Desktop/cmps3160/proj/finalProj/GavinHHE.github.io/cardio_disease.csv',';')
Disease_df
| id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 18393 | 2 | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 1 | 20228 | 1 | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 2 | 18857 | 1 | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 3 | 17623 | 2 | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 4 | 17474 | 1 | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 69995 | 99993 | 19240 | 2 | 168 | 76.0 | 120 | 80 | 1 | 1 | 1 | 0 | 1 | 0 |
| 69996 | 99995 | 22601 | 1 | 158 | 126.0 | 140 | 90 | 2 | 2 | 0 | 0 | 1 | 1 |
| 69997 | 99996 | 19066 | 2 | 183 | 105.0 | 180 | 90 | 3 | 1 | 0 | 1 | 0 | 1 |
| 69998 | 99998 | 22431 | 1 | 163 | 72.0 | 135 | 80 | 1 | 2 | 0 | 0 | 0 | 1 |
| 69999 | 99999 | 20540 | 1 | 170 | 72.0 | 120 | 80 | 2 | 1 | 0 | 0 | 1 | 0 |
70000 rows × 13 columns
##Checkling null values for each columns
Disease_df.isna().value_counts()
id age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active cardio False False False False False False False False False False False False False 70000 dtype: int64
### I will also check the abnormal values for columns by looking at the minimum and maximum values
for var in Disease_df.columns.values:
print('In column '+str(var)+
' Max values :'+str(Disease_df[var].max())+
' Min values :'+str(Disease_df[var].min()))
print('------------------')
In column id Max values :99999 Min values :0 ------------------ In column age Max values :23713 Min values :10798 ------------------ In column gender Max values :2 Min values :1 ------------------ In column height Max values :250 Min values :55 ------------------ In column weight Max values :200.0 Min values :10.0 ------------------ In column ap_hi Max values :16020 Min values :-150 ------------------ In column ap_lo Max values :11000 Min values :-70 ------------------ In column cholesterol Max values :3 Min values :1 ------------------ In column gluc Max values :3 Min values :1 ------------------ In column smoke Max values :1 Min values :0 ------------------ In column alco Max values :1 Min values :0 ------------------ In column active Max values :1 Min values :0 ------------------ In column cardio Max values :1 Min values :0 ------------------
### Since a gegative value for both ap_hi and ap_lo is not realistic, I will convert those negative values into positive
Disease_df.ap_hi = Disease_df.ap_hi.apply(lambda x: x if x>0 else x*(-1))
Disease_df.ap_lo = Disease_df.ap_lo.apply(lambda x: x if x>0 else x*(-1))
There is no explaination on the meaning of 1 and 2 of gender column. After compared the mean weight and height, I figured out Gender value 2 is male, Gender value 1 is female.
Disease_df.groupby('gender').height.mean()
gender 1 161.355612 2 169.947895 Name: height, dtype: float64
Disease_df.groupby('gender').weight.mean()
gender 1 72.565605 2 77.257307 Name: weight, dtype: float64
## Gender value 2 is male, gender value 1 is female
Disease_df.gender = Disease_df.gender.apply(lambda x: 'Female' if x==1 else 'Male')
Accoring to the documentation, age is stored in days.
### Age is recorded in days, I will convert those values into years
Disease_df.age = Disease_df.age.apply(lambda x: round(x/365,1))
By looking at the box plots of height,weight,ap_hi and ap_lo, there are many extreme values that some are unrealistic. I will remove the unrealistic values. Outliers could be meaningful here, I will transform outliers later.
Disease_df[['height','weight']].boxplot()
plt.show()
According to https://en.wikipedia.org/wiki/List_of_the_verified_shortest_people, the shortest recorded is 54.6. I will remove rows that has highet lower than 55
Disease_df = Disease_df[Disease_df.height>=55]
Disease_df[['ap_hi','ap_lo']].boxplot()
plt.show()
I observed many outliers in both ap_hi and ap_low. According to https://pubmed.ncbi.nlm.nih.gov/7741618/, the highest highest pressure recorded is 370/360, I will remove rows that has ap_hi or ap_low higher than 360
Disease_df = Disease_df[(Disease_df.ap_hi<=360)& (Disease_df.ap_lo<=360)]
It is unrealistic to have living person with DIASTOLIC pressure equals to or greater than SYSTOLIC pressure
It is unrealistic to have living person wit DIASTOLIC and SYSTOLIC pressures less than 50
Disease_df = Disease_df[(Disease_df.ap_hi>=50) & (Disease_df.ap_lo>=50)]
Disease_df = Disease_df[Disease_df.ap_lo<Disease_df.ap_hi]
Disease_df.shape
(68659, 13)
## rest index
Disease_df.reset_index(drop=True,inplace=True)
Disease_df.head()
| id | age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 50.4 | Male | 168 | 62.0 | 110 | 80 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1 | 1 | 55.4 | Female | 156 | 85.0 | 140 | 90 | 3 | 1 | 0 | 0 | 1 | 1 |
| 2 | 2 | 51.7 | Female | 165 | 64.0 | 130 | 70 | 3 | 1 | 0 | 0 | 0 | 1 |
| 3 | 3 | 48.3 | Male | 169 | 82.0 | 150 | 100 | 1 | 1 | 0 | 0 | 1 | 1 |
| 4 | 4 | 47.9 | Female | 156 | 56.0 | 100 | 60 | 1 | 1 | 0 | 0 | 0 | 0 |
Accoring to the documentation, gender,cholesterol,gluc,smoke,alco,active, and cardio are category variables
## Check data type.
Disease_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 68659 entries, 0 to 68658 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 68659 non-null int64 1 age 68659 non-null float64 2 gender 68659 non-null object 3 height 68659 non-null int64 4 weight 68659 non-null float64 5 ap_hi 68659 non-null int64 6 ap_lo 68659 non-null int64 7 cholesterol 68659 non-null int64 8 gluc 68659 non-null int64 9 smoke 68659 non-null int64 10 alco 68659 non-null int64 11 active 68659 non-null int64 12 cardio 68659 non-null int64 dtypes: float64(2), int64(10), object(1) memory usage: 6.8+ MB
Values of 1, 2 and 3 are hard to interpret for columns cholesterol and gluc. I maped both columns according to the data spcification provided.
Disease_df["cholesterol"]=Disease_df["cholesterol"].map({
1: "normal",
2: "above normal",
3: "well above normal",
})
Disease_df["gluc"]=Disease_df["gluc"].map({
1: "normal",
2: "above normal",
3: "well above normal",
})
Disease_dfCleaned = Disease_df.drop(columns='id').copy()
Disease_dfCleaned
| age | gender | height | weight | ap_hi | ap_lo | cholesterol | gluc | smoke | alco | active | cardio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50.4 | Male | 168 | 62.0 | 110 | 80 | normal | normal | 0 | 0 | 1 | 0 |
| 1 | 55.4 | Female | 156 | 85.0 | 140 | 90 | well above normal | normal | 0 | 0 | 1 | 1 |
| 2 | 51.7 | Female | 165 | 64.0 | 130 | 70 | well above normal | normal | 0 | 0 | 0 | 1 |
| 3 | 48.3 | Male | 169 | 82.0 | 150 | 100 | normal | normal | 0 | 0 | 1 | 1 |
| 4 | 47.9 | Female | 156 | 56.0 | 100 | 60 | normal | normal | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 68654 | 52.7 | Male | 168 | 76.0 | 120 | 80 | normal | normal | 1 | 0 | 1 | 0 |
| 68655 | 61.9 | Female | 158 | 126.0 | 140 | 90 | above normal | above normal | 0 | 0 | 1 | 1 |
| 68656 | 52.2 | Male | 183 | 105.0 | 180 | 90 | well above normal | normal | 0 | 1 | 0 | 1 |
| 68657 | 61.5 | Female | 163 | 72.0 | 135 | 80 | normal | above normal | 0 | 0 | 0 | 1 |
| 68658 | 56.3 | Female | 170 | 72.0 | 120 | 80 | above normal | normal | 0 | 0 | 1 | 0 |
68659 rows × 12 columns
## transform data type
colList = ['gender','cholesterol','gluc','smoke','alco','active','cardio']
for var in colList:
Disease_dfCleaned[var] = Disease_dfCleaned[var].astype('category')
Disease_dfCleaned.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 68659 entries, 0 to 68658 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 68659 non-null float64 1 gender 68659 non-null category 2 height 68659 non-null int64 3 weight 68659 non-null float64 4 ap_hi 68659 non-null int64 5 ap_lo 68659 non-null int64 6 cholesterol 68659 non-null category 7 gluc 68659 non-null category 8 smoke 68659 non-null category 9 alco 68659 non-null category 10 active 68659 non-null category 11 cardio 68659 non-null category dtypes: category(7), float64(2), int64(3) memory usage: 3.1 MB
figure = plt.figure(figsize=(20,4))
plt.subplot(1,5,1)
ax1 = plt.hist(Disease_dfCleaned['weight'], bins=30)
plt.title('Weight')
plt.subplot(1,5,2)
ax2 = plt.hist(Disease_dfCleaned['height'], bins=30)
plt.title('Height')
plt.subplot(1,5,3)
ax2 = plt.hist(Disease_dfCleaned['ap_lo'], bins=30)
plt.title('ap_lo')
plt.subplot(1,5,4)
ax2 = plt.hist(Disease_dfCleaned['ap_hi'], bins=30)
plt.title('ap_hi')
plt.subplot(1,5,5)
ax2 = plt.hist(Disease_dfCleaned['age'], bins=10)
plt.title('Age')
plt.show()
By looking at the distribution of those 5 columns, weight and height seems follow normal distribution. There are some outliers in weight and height, but the number of outliers is too small to be shown on the grah. Age is postively skewed. It is hard to tell the distribution of ap_lo and ap_hi.
figure = plt.figure(figsize=(15,5))
plt.subplot(1,5,1)
ax1 = Disease_dfCleaned.groupby('cardio').weight.mean().plot.bar(color ='green')
plt.title('Mean Weight')
plt.subplot(1,5,2)
ax2 = Disease_dfCleaned.groupby('cardio').height.mean().plot.bar(color ='green')
plt.title('Mean Height')
plt.subplot(1,5,3)
ax3 = Disease_dfCleaned.groupby('cardio').ap_lo.mean().plot.bar(color ='green')
plt.title('Mean ap_lo')
plt.subplot(1,5,4)
ax4 = Disease_dfCleaned.groupby('cardio').ap_hi.mean().plot.bar(color ='green')
plt.title('Mean ap_hi')
plt.subplot(1,5,5)
ax5 = Disease_dfCleaned.groupby('cardio').age.mean().plot.bar(color ='green')
plt.title('Mean Age')
plt.show()
Here, I compared the mean value for those 5 columns. The graphs show that people with cardiovascular disease are slightly orlder and have higher blood pressure measurement.
figure = plt.figure(figsize=(25,5))
plt.subplot(1,5,1)
ax1 = Disease_dfCleaned[Disease_dfCleaned.smoke==1].cardio.value_counts().plot.bar(color ='green')
plt.title('Cardiovascular diseases among smoker')
plt.subplot(1,5,2)
ax1 = Disease_dfCleaned[Disease_dfCleaned.alco==1].cardio.value_counts().plot.bar(color ='green')
plt.title('Cardiovascular diseases among alcohol use')
plt.show()
I visualized the number of smokers and drinkers among people with cardiovascular diseases. From the graph, I would say the smoking or drinking may has no significant impact on cardiovascular diseases.
sns.heatmap(Disease_df.corr(), annot=True)
plt.gcf().set_size_inches(10,8)
I also created a heatmap to show the correlation between variables. The disease indicator, cardio, has high correlation with ap_hi and ap_low. Age and weight are also correlated with cardio.
diabetes_df= pd.read_csv('C:/Users/hehan/Desktop/cmps3160/proj/finalProj/GavinHHE.github.io/diabetes_5050split_health_BRFSS2015.csv')
diabetes_df
| Diabetes_binary | HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | Fruits | ... | AnyHealthcare | NoDocbcCost | GenHlth | MentHlth | PhysHlth | DiffWalk | Sex | Age | Education | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 1.0 | 26.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 3.0 | 5.0 | 30.0 | 0.0 | 1.0 | 4.0 | 6.0 | 8.0 |
| 1 | 0.0 | 1.0 | 1.0 | 1.0 | 26.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 12.0 | 6.0 | 8.0 |
| 2 | 0.0 | 0.0 | 0.0 | 1.0 | 26.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 10.0 | 0.0 | 1.0 | 13.0 | 6.0 | 8.0 |
| 3 | 0.0 | 1.0 | 1.0 | 1.0 | 28.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 3.0 | 0.0 | 3.0 | 0.0 | 1.0 | 11.0 | 6.0 | 8.0 |
| 4 | 0.0 | 0.0 | 0.0 | 1.0 | 29.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 8.0 | 5.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 70687 | 1.0 | 0.0 | 1.0 | 1.0 | 37.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 4.0 | 1.0 |
| 70688 | 1.0 | 0.0 | 1.0 | 1.0 | 29.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 1.0 | 1.0 | 10.0 | 3.0 | 6.0 |
| 70689 | 1.0 | 1.0 | 1.0 | 1.0 | 25.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 5.0 | 15.0 | 0.0 | 1.0 | 0.0 | 13.0 | 6.0 | 4.0 |
| 70690 | 1.0 | 1.0 | 1.0 | 1.0 | 18.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 4.0 | 0.0 | 0.0 | 1.0 | 0.0 | 11.0 | 2.0 | 4.0 |
| 70691 | 1.0 | 1.0 | 1.0 | 1.0 | 25.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 | 6.0 | 2.0 |
70692 rows × 22 columns
##Checkling null values for each columns
diabetes_df.isna().info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 70692 entries, 0 to 70691 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Diabetes_binary 70692 non-null bool 1 HighBP 70692 non-null bool 2 HighChol 70692 non-null bool 3 CholCheck 70692 non-null bool 4 BMI 70692 non-null bool 5 Smoker 70692 non-null bool 6 Stroke 70692 non-null bool 7 HeartDiseaseorAttack 70692 non-null bool 8 PhysActivity 70692 non-null bool 9 Fruits 70692 non-null bool 10 Veggies 70692 non-null bool 11 HvyAlcoholConsump 70692 non-null bool 12 AnyHealthcare 70692 non-null bool 13 NoDocbcCost 70692 non-null bool 14 GenHlth 70692 non-null bool 15 MentHlth 70692 non-null bool 16 PhysHlth 70692 non-null bool 17 DiffWalk 70692 non-null bool 18 Sex 70692 non-null bool 19 Age 70692 non-null bool 20 Education 70692 non-null bool 21 Income 70692 non-null bool dtypes: bool(22) memory usage: 1.5 MB
Conver the data type according to the column description provided by data uploader
col = ['MentHlth','PhysHlth','BMI']
for var in col:
diabetes_df[var]=diabetes_df[var].astype('int')
col = ['Age','Education','Income','Sex','GenHlth','Income']
for var in col:
diabetes_df[var]=diabetes_df[var].astype('category')
diabetes_df
| Diabetes_binary | HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | Fruits | ... | AnyHealthcare | NoDocbcCost | GenHlth | MentHlth | PhysHlth | DiffWalk | Sex | Age | Education | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 1.0 | 26 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 3.0 | 5 | 30 | 0.0 | 1.0 | 4.0 | 6.0 | 8.0 |
| 1 | 0.0 | 1.0 | 1.0 | 1.0 | 26 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 3.0 | 0 | 0 | 0.0 | 1.0 | 12.0 | 6.0 | 8.0 |
| 2 | 0.0 | 0.0 | 0.0 | 1.0 | 26 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 0 | 10 | 0.0 | 1.0 | 13.0 | 6.0 | 8.0 |
| 3 | 0.0 | 1.0 | 1.0 | 1.0 | 28 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 3.0 | 0 | 3 | 0.0 | 1.0 | 11.0 | 6.0 | 8.0 |
| 4 | 0.0 | 0.0 | 0.0 | 1.0 | 29 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0 | 0 | 0.0 | 0.0 | 8.0 | 5.0 | 8.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 70687 | 1.0 | 0.0 | 1.0 | 1.0 | 37 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 4.0 | 0 | 0 | 0.0 | 0.0 | 6.0 | 4.0 | 1.0 |
| 70688 | 1.0 | 0.0 | 1.0 | 1.0 | 29 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0 | 0 | 1.0 | 1.0 | 10.0 | 3.0 | 6.0 |
| 70689 | 1.0 | 1.0 | 1.0 | 1.0 | 25 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 5.0 | 15 | 0 | 1.0 | 0.0 | 13.0 | 6.0 | 4.0 |
| 70690 | 1.0 | 1.0 | 1.0 | 1.0 | 18 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 4.0 | 0 | 0 | 1.0 | 0.0 | 11.0 | 2.0 | 4.0 |
| 70691 | 1.0 | 1.0 | 1.0 | 1.0 | 25 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0 | 0 | 0.0 | 0.0 | 9.0 | 6.0 | 2.0 |
70692 rows × 22 columns
Since most of columns are either catergorcial variables or boolean variables, I will check the distribution of BMI only
diabetes_df[['BMI']].boxplot()
plt.show()
In rare cases people have BMI greater than 50, I will drop rows with BMI higher than 50.
diabetes_df=diabetes_df[diabetes_df.BMI<=50]
There are few outliers, the potential effect of outliers in this case should be small since we have only few outliers
diabetes_df[['BMI']].boxplot()
<AxesSubplot:>
figure = plt.figure(figsize=(18,5))
plt.subplot(1,5,1)
ax1 = plt.hist(diabetes_df['BMI'], bins=20)
plt.title('BMI')
plt.subplot(1,5,2)
ax2 = diabetes_df.groupby('Diabetes_binary').BMI.mean().plot.bar(color ='green')
plt.title('Mean BMI')
Text(0.5, 1.0, 'Mean BMI')
BMI data seems follow normal distribtion. From another graph, I observed that people with Diabetes are tend to have higher BMI.
figure = plt.figure(figsize=(18,5))
plt.subplot(1,5,1)
ax2 = diabetes_df[diabetes_df.Smoker==1].Diabetes_binary.value_counts().plot.bar(color ='green')
plt.title('diabetes among smoker')
plt.subplot(1,5,2)
ax2 = diabetes_df[diabetes_df.HvyAlcoholConsump==1].Diabetes_binary.value_counts().plot.bar(color ='green')
plt.title('diabetes among HvyAlcoholConsump')
plt.subplot(1,5,3)
ax2 = diabetes_df[diabetes_df.HighBP==1].Diabetes_binary.value_counts().plot.bar(color ='green')
plt.title('diabetes among High BP')
plt.show()
Among people who smokes, there is a higher chance that he/she has diagnosed with Diabetes already. Among people who have high blood pressure, there is a significant higher chance that he/she has diagnosed with Diabetes already.
sns.heatmap(diabetes_df.corr())
plt.gcf().set_size_inches(12,11)
I observed that the correlations between Diabetes_binary and HighBP/ BMI are relatively higher. There are also high correlation between PhysHlth and DiffWalk and between MentHleth and PhysHlth. I will remove PhysHlth when applying classification model on it.
With the machine learning model, I will be able to determine whether High Blood Pressure is positively related to Cardiovascular Disease and Diabetes. Also, risks of other factors can be measured by the coefficient of the model result.
For both dataset, I can use Logistic regression or Randomforest to make meaningful predictions and find the importance of HBP as a risk factor. Both datasets are labled and have only two possible values(0 or 1). I will start with Logistic regression since it is good for classification prediction and the coefficients are more interpretable.
For Cardio Vascular Disease Detection, I will incorporate BMI as a new feature beacuse I think BMI is a better indicator to show the overall health of a person than the combination of height and weight. I will also transform BMI and age to categorical variables. I believe that each category of BMI and Age is more meaningful than just numbers.
Categorical variables in Cardio Vascular Disease Detection data will be transform to dummy variables. For both Cardio Vascular Disease Detection and Diabetes data, although there is no columns contains extremly large or small values, I will standardizing the data. There are still many outliers in the Cardio Vascular Disease dataset after I removed extrme values or values are unrealistic. I will replace the rest of outliers by boundary values that calculated using 2.0 x Interquartile Range.
The target variable for Cardio Vascular Disease Detection data is cardio, which indicate whether a person has Cardioascular Disease or not. The dependent variable will be the rest of the columns including age_range, gender and ect. Ap_lo is highly correlated to Ap_hi(0.74 from the heat map), so I will remove Ap_lo in further analysis. For diabetes dataset, the target variable is Diabetes_binary, which indicate whether a person has Diabetes or not. The dependent variable will include rest of columns exclude the PhysHlth since it is highly correlated with MentHlth and DiffWalk. I will also remove AnyHealthcare and NoDocbcCost because I do not think either columns are related to Diabetes.
Both dataset are balanced and do not need to resample data.
figure = plt.figure(figsize=(18,5))
plt.subplot(1,5,1)
ax2 = diabetes_df.Diabetes_binary.value_counts().plot.bar(color ='blue')
plt.title('diabetes data distribution')
plt.subplot(1,5,2)
ax2 = Disease_dfCleaned.cardio.value_counts().plot.bar(color ='green')
plt.title('cardio data distribution')
Text(0.5, 1.0, 'cardio data distribution')
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from tqdm import tqdm
cardio_df_cleaned = Disease_dfCleaned.copy()
## Add BMI as a new feature
cardio_df_cleaned['BMI'] = cardio_df_cleaned.weight/((cardio_df_cleaned.height/100)**2)
cardio_df_cleaned[['BMI']].boxplot()
<AxesSubplot:>
As I mentioned, there are many outliers in the dataset. BMI, which is caculated by height and weight, also contains outliers.If I just drop those outliers, I might lose some important information. Therefore, I will tolerate outliers to some extend. I will replace the outliers by values that calculated using 2.0 x Interquartile Range, which I think are more acceptable. Since height and weight will be replace by BMI and Ap_lo will be removed because of high correlation, I will transform BMI and ap_hi only.
## replace the outliers by boundary values that calculated using 2.0 x Interquartile Range.
collist=['BMI','ap_hi']
for var in collist:
IQR = cardio_df_cleaned[var].quantile(0.75)-cardio_df_cleaned[var].quantile(0.25)
lowest_boundary = cardio_df_cleaned[var].quantile(0.25)-2*IQR
highest_boundary = cardio_df_cleaned[var].quantile(0.75)+2*IQR
for i, x in enumerate( tqdm(range(cardio_df_cleaned.shape[0])) ):
if cardio_df_cleaned[var][i]>=highest_boundary:
cardio_df_cleaned[var][i] = highest_boundary
elif cardio_df_cleaned[var][i]<=lowest_boundary:
cardio_df_cleaned[var][i] =lowest_boundary
100%|████████████████████████████████████████████████████████████████████████| 68659/68659 [00:00<00:00, 106542.78it/s] 100%|█████████████████████████████████████████████████████████████████████████| 68659/68659 [00:00<00:00, 70634.93it/s]
cardio_df_cleaned[['BMI']].boxplot()
<AxesSubplot:>
cardio_df_cleaned[['ap_hi']].boxplot()
<AxesSubplot:>
## get dummy variables for categorical variables.
cardio_df_cleaned = pd.get_dummies(cardio_df_cleaned,columns=['gender','cholesterol','gluc'])
cardio_df_cleaned.head()
| age | height | weight | ap_hi | ap_lo | smoke | alco | active | cardio | BMI | gender_Female | gender_Male | cholesterol_above normal | cholesterol_normal | cholesterol_well above normal | gluc_above normal | gluc_normal | gluc_well above normal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50.4 | 168 | 62.0 | 110 | 80 | 0 | 0 | 1 | 0 | 21.967120 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 1 | 55.4 | 156 | 85.0 | 140 | 90 | 0 | 0 | 1 | 1 | 34.927679 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 2 | 51.7 | 165 | 64.0 | 130 | 70 | 0 | 0 | 0 | 1 | 23.507805 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 3 | 48.3 | 169 | 82.0 | 150 | 100 | 0 | 0 | 1 | 1 | 28.710479 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 4 | 47.9 | 156 | 56.0 | 100 | 60 | 0 | 0 | 0 | 0 | 23.011177 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
## transform BMI into categorical variables.
cardio_df_cleaned['BMI_range'] = pd.cut(x=cardio_df_cleaned['BMI'], bins=[0, 18, 25, 30, 600])
cardio_df_cleaned['BMI_range'] = cardio_df_cleaned['BMI_range'].astype('str')
cardio_df_cleaned.BMI_range = cardio_df_cleaned.BMI_range.str.replace('(', '')
cardio_df_cleaned.BMI_range = cardio_df_cleaned.BMI_range.str.replace(']', '')
cardio_df_cleaned.BMI_range = cardio_df_cleaned.BMI_range.str.replace('0, 18', 'Below 18')
cardio_df_cleaned.BMI_range = cardio_df_cleaned.BMI_range.str.replace('30, 600', 'Above 30')
cardio_df_cleaned
<ipython-input-53-12d2b9613498>:4: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True. <ipython-input-53-12d2b9613498>:5: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
| age | height | weight | ap_hi | ap_lo | smoke | alco | active | cardio | BMI | gender_Female | gender_Male | cholesterol_above normal | cholesterol_normal | cholesterol_well above normal | gluc_above normal | gluc_normal | gluc_well above normal | BMI_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50.4 | 168 | 62.0 | 110 | 80 | 0 | 0 | 1 | 0 | 21.967120 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 18, 25 |
| 1 | 55.4 | 156 | 85.0 | 140 | 90 | 0 | 0 | 1 | 1 | 34.927679 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | Above 30 |
| 2 | 51.7 | 165 | 64.0 | 130 | 70 | 0 | 0 | 0 | 1 | 23.507805 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 18, 25 |
| 3 | 48.3 | 169 | 82.0 | 150 | 100 | 0 | 0 | 1 | 1 | 28.710479 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 25, 30 |
| 4 | 47.9 | 156 | 56.0 | 100 | 60 | 0 | 0 | 0 | 0 | 23.011177 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 18, 25 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 68654 | 52.7 | 168 | 76.0 | 120 | 80 | 1 | 0 | 1 | 0 | 26.927438 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 25, 30 |
| 68655 | 61.9 | 158 | 126.0 | 140 | 90 | 0 | 0 | 1 | 1 | 42.607897 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | Above 30 |
| 68656 | 52.2 | 183 | 105.0 | 180 | 90 | 0 | 1 | 0 | 1 | 31.353579 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | Above 30 |
| 68657 | 61.5 | 163 | 72.0 | 135 | 80 | 0 | 0 | 0 | 1 | 27.099251 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 25, 30 |
| 68658 | 56.3 | 170 | 72.0 | 120 | 80 | 0 | 0 | 1 | 0 | 24.913495 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 18, 25 |
68659 rows × 19 columns
## Standardize Ap_hi
cardio_df_cleaned['ap_hi_std']=(cardio_df_cleaned.ap_hi-cardio_df_cleaned.ap_hi.mean())/cardio_df_cleaned.ap_hi.std()
# cardio_df_cleaned['ap_lo_std']=(cardio_df_cleaned.ap_lo-cardio_df_cleaned.ap_lo.mean())/cardio_df_cleaned.ap_lo.std()
## remove features
cardio_df_cleaned.drop(['height', 'weight','ap_hi','ap_lo'], axis=1, inplace=True)
## transform BMI into categorical variables.
cardio_df_cleaned['age_range'] = pd.cut(x=cardio_df_cleaned['age'], bins=[20, 30, 40, 50, 60, 90])
cardio_df_cleaned['age_range'] = cardio_df_cleaned['age_range'].astype('str')
cardio_df_cleaned.age_range = cardio_df_cleaned.age_range.str.replace('(', '')
cardio_df_cleaned.age_range = cardio_df_cleaned.age_range.str.replace(']', '')
cardio_df_cleaned
<ipython-input-56-6f6092284537>:4: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True. <ipython-input-56-6f6092284537>:5: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
| age | smoke | alco | active | cardio | BMI | gender_Female | gender_Male | cholesterol_above normal | cholesterol_normal | cholesterol_well above normal | gluc_above normal | gluc_normal | gluc_well above normal | BMI_range | ap_hi_std | age_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50.4 | 0 | 0 | 1 | 0 | 21.967120 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 18, 25 | -1.013284 | 50, 60 |
| 1 | 55.4 | 0 | 0 | 1 | 1 | 34.927679 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | Above 30 | 0.818023 | 50, 60 |
| 2 | 51.7 | 0 | 0 | 0 | 1 | 23.507805 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 18, 25 | 0.207588 | 50, 60 |
| 3 | 48.3 | 0 | 0 | 1 | 1 | 28.710479 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 25, 30 | 1.428459 | 40, 50 |
| 4 | 47.9 | 0 | 0 | 0 | 0 | 23.011177 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 18, 25 | -1.623720 | 40, 50 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 68654 | 52.7 | 1 | 0 | 1 | 0 | 26.927438 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 25, 30 | -0.402848 | 50, 60 |
| 68655 | 61.9 | 0 | 0 | 1 | 1 | 42.607897 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | Above 30 | 0.818023 | 60, 90 |
| 68656 | 52.2 | 0 | 1 | 0 | 1 | 31.353579 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | Above 30 | 3.259766 | 50, 60 |
| 68657 | 61.5 | 0 | 0 | 0 | 1 | 27.099251 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 25, 30 | 0.512805 | 60, 90 |
| 68658 | 56.3 | 0 | 0 | 1 | 0 | 24.913495 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 18, 25 | -0.402848 | 50, 60 |
68659 rows × 17 columns
cardio_df_cleaned.age_range = cardio_df_cleaned.age_range.str.replace('60, 90', 'Above 60')
cardio_df_cleaned.age_range = cardio_df_cleaned.age_range.str.replace('20, 30', 'Below 30')
## get dummy variables for age_range and BMI_range.
cardio_df_cleaned = pd.get_dummies(cardio_df_cleaned,columns=['age_range'])
cardio_df_cleaned = pd.get_dummies(cardio_df_cleaned,columns=['BMI_range'])
cardio_df_cleaned.head()
| age | smoke | alco | active | cardio | BMI | gender_Female | gender_Male | cholesterol_above normal | cholesterol_normal | ... | ap_hi_std | age_range_30, 40 | age_range_40, 50 | age_range_50, 60 | age_range_Above 60 | age_range_Below 30 | BMI_range_18, 25 | BMI_range_25, 30 | BMI_range_Above 30 | BMI_range_Below 18 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50.4 | 0 | 0 | 1 | 0 | 21.967120 | 0 | 1 | 0 | 1 | ... | -1.013284 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 55.4 | 0 | 0 | 1 | 1 | 34.927679 | 1 | 0 | 0 | 0 | ... | 0.818023 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 51.7 | 0 | 0 | 0 | 1 | 23.507805 | 1 | 0 | 0 | 0 | ... | 0.207588 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 48.3 | 0 | 0 | 1 | 1 | 28.710479 | 0 | 1 | 0 | 1 | ... | 1.428459 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 47.9 | 0 | 0 | 0 | 0 | 23.011177 | 1 | 0 | 0 | 1 | ... | -1.623720 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
5 rows × 24 columns
## remove one dummy variable for each catergorical variables to avoid multicollinearity
cardio_df_cleaned.drop(['gender_Male', 'cholesterol_normal','gluc_normal','age','BMI','age_range_Below 30','BMI_range_18, 25'], axis=1, inplace=True)
cardio_df_cleaned.head()
| smoke | alco | active | cardio | gender_Female | cholesterol_above normal | cholesterol_well above normal | gluc_above normal | gluc_well above normal | ap_hi_std | age_range_30, 40 | age_range_40, 50 | age_range_50, 60 | age_range_Above 60 | BMI_range_25, 30 | BMI_range_Above 30 | BMI_range_Below 18 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | -1.013284 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0.818023 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 2 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0.207588 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1.428459 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | -1.623720 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
## After scaling, Ap_hi_std is slightly skewed to the left.
figure = plt.figure(figsize=(20,4))
plt.subplot(1,5,1)
ax1 = plt.hist(cardio_df_cleaned['ap_hi_std'], bins=30)
plt.title('ap_hi_std')
plt.show()
X = cardio_df_cleaned.drop(columns=['cardio'])
y = cardio_df_cleaned['cardio']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, # reserve 30% data for testing
stratify=y, # use stratified sampling
random_state=1)
## Ref: https://stackoverflow.com/questions/22306341/python-sklearn-how-to-calculate-p-values
import statsmodels.api as sm
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
print(result.summary())
Optimization terminated successfully.
Current function value: 0.563663
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: cardio No. Observations: 48061
Model: Logit Df Residuals: 48045
Method: MLE Df Model: 15
Date: Fri, 17 Dec 2021 Pseudo R-squ.: 0.1867
Time: 17:34:16 Log-Likelihood: -27090.
converged: True LL-Null: -33311.
Covariance Type: nonrobust LLR p-value: 0.000
=================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------
smoke -0.1484 0.041 -3.600 0.000 -0.229 -0.068
alco -0.2114 0.050 -4.201 0.000 -0.310 -0.113
active -0.2485 0.026 -9.532 0.000 -0.300 -0.197
gender_Female -0.0494 0.023 -2.111 0.035 -0.095 -0.004
cholesterol_above normal 0.3713 0.033 11.421 0.000 0.308 0.435
cholesterol_well above normal 1.1212 0.043 26.338 0.000 1.038 1.205
gluc_above normal 0.0756 0.043 1.755 0.079 -0.009 0.160
gluc_well above normal -0.3425 0.047 -7.273 0.000 -0.435 -0.250
ap_hi_std 0.9997 0.014 72.467 0.000 0.973 1.027
age_range_30, 40 -0.8361 0.076 -11.012 0.000 -0.985 -0.687
age_range_40, 50 -0.3402 0.035 -9.617 0.000 -0.409 -0.271
age_range_50, 60 -0.0010 0.033 -0.029 0.977 -0.066 0.064
age_range_Above 60 0.5107 0.039 13.135 0.000 0.435 0.587
BMI_range_25, 30 0.1911 0.024 7.886 0.000 0.144 0.239
BMI_range_Above 30 0.3608 0.028 13.015 0.000 0.306 0.415
BMI_range_Below 18 -0.1868 0.147 -1.273 0.203 -0.474 0.101
=================================================================================================
Before I run the LogisticRegression model using Sklearn, I analyze the significance(p_value < 0.05) for each variable using statsmodels. I observed that gluc_above normal, age_range_50, 60 ,and BMI_range_Below 18 are not significant.
cardio_df_cleaned.drop(['age_range_50, 60', 'gluc_above normal','BMI_range_Below 18'], axis=1, inplace=True)
X = cardio_df_cleaned.drop(columns=['cardio'])
y = cardio_df_cleaned['cardio']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, # reserve 30% data for testing
stratify=y, # use stratified sampling
random_state=1)
from sklearn.linear_model import LogisticRegression
model_LR = LogisticRegression(penalty='l2', random_state=1)
model_LR.fit(X_train,y_train.values)
y_predict = model_LR.predict(X_test)
accuracy = accuracy_score(y_test, y_predict).round(4)
print(f"The accuracy is: {(accuracy*100).round(4)}%")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)
# save the results for later comparison
accuracy_LR = accuracy
cm_LR = cm
The accuracy is: 73.04% The confusion matrix is: [[8152 2256] [3297 6893]]
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",model_LR.coef_.transpose())
coef_table
| 0 | Coefs | |
|---|---|---|
| 0 | smoke | -0.148240 |
| 1 | alco | -0.209658 |
| 2 | active | -0.248575 |
| 3 | gender_Female | -0.049274 |
| 4 | cholesterol_above normal | 0.384858 |
| 5 | cholesterol_well above normal | 1.121758 |
| 6 | gluc_well above normal | -0.346879 |
| 7 | ap_hi_std | 1.000327 |
| 8 | age_range_30, 40 | -0.830525 |
| 9 | age_range_40, 50 | -0.339052 |
| 10 | age_range_Above 60 | 0.511884 |
| 11 | BMI_range_25, 30 | 0.194649 |
| 12 | BMI_range_Above 30 | 0.366645 |
## Ref: https://towardsdatascience.com/interpreting-coefficients-in-linear-and-logistic-regression-6ddf1295f6f1
coef_table['Odd'] = np.exp(coef_table.Coefs)
coef_table
| 0 | Coefs | Odd | |
|---|---|---|---|
| 0 | smoke | -0.148240 | 0.862224 |
| 1 | alco | -0.209658 | 0.810862 |
| 2 | active | -0.248575 | 0.779911 |
| 3 | gender_Female | -0.049274 | 0.951921 |
| 4 | cholesterol_above normal | 0.384858 | 1.469406 |
| 5 | cholesterol_well above normal | 1.121758 | 3.070246 |
| 6 | gluc_well above normal | -0.346879 | 0.706891 |
| 7 | ap_hi_std | 1.000327 | 2.719170 |
| 8 | age_range_30, 40 | -0.830525 | 0.435820 |
| 9 | age_range_40, 50 | -0.339052 | 0.712445 |
| 10 | age_range_Above 60 | 0.511884 | 1.668432 |
| 11 | BMI_range_25, 30 | 0.194649 | 1.214884 |
| 12 | BMI_range_Above 30 | 0.366645 | 1.442886 |
coef_table.sort_values(by=['Odd'], ascending=False)
| 0 | Coefs | Odd | |
|---|---|---|---|
| 5 | cholesterol_well above normal | 1.121758 | 3.070246 |
| 7 | ap_hi_std | 1.000327 | 2.719170 |
| 10 | age_range_Above 60 | 0.511884 | 1.668432 |
| 4 | cholesterol_above normal | 0.384858 | 1.469406 |
| 12 | BMI_range_Above 30 | 0.366645 | 1.442886 |
| 11 | BMI_range_25, 30 | 0.194649 | 1.214884 |
| 3 | gender_Female | -0.049274 | 0.951921 |
| 0 | smoke | -0.148240 | 0.862224 |
| 1 | alco | -0.209658 | 0.810862 |
| 2 | active | -0.248575 | 0.779911 |
| 9 | age_range_40, 50 | -0.339052 | 0.712445 |
| 6 | gluc_well above normal | -0.346879 | 0.706891 |
| 8 | age_range_30, 40 | -0.830525 | 0.435820 |
How to interpret: https://towardsdatascience.com/interpreting-coefficients-in-linear-and-logistic-regression-6ddf1295f6f1
As variable cholesterol_well above normal increases by one unit(from 0 to 1), the odds that this person is in the target class (“1”) are over 3x as large as the odds that he/she won’t be in the target class. On the other hand, as age_range_30, 40 increases by one unit(from 0 to 1), the odds that the observation is NOT in the target class are 1/0.44 or 2.27x as likely as the odds that it IS in the target class.
Next, I will run Logistic Regression on diabetes dataset
diabetes_df_cleaned = diabetes_df.copy()
diabetes_df_cleaned.head()
| Diabetes_binary | HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | Fruits | ... | AnyHealthcare | NoDocbcCost | GenHlth | MentHlth | PhysHlth | DiffWalk | Sex | Age | Education | Income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 1.0 | 26 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 3.0 | 5 | 30 | 0.0 | 1.0 | 4.0 | 6.0 | 8.0 |
| 1 | 0.0 | 1.0 | 1.0 | 1.0 | 26 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 3.0 | 0 | 0 | 0.0 | 1.0 | 12.0 | 6.0 | 8.0 |
| 2 | 0.0 | 0.0 | 0.0 | 1.0 | 26 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 0 | 10 | 0.0 | 1.0 | 13.0 | 6.0 | 8.0 |
| 3 | 0.0 | 1.0 | 1.0 | 1.0 | 28 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 3.0 | 0 | 3 | 0.0 | 1.0 | 11.0 | 6.0 | 8.0 |
| 4 | 0.0 | 0.0 | 0.0 | 1.0 | 29 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 1.0 | 0.0 | 2.0 | 0 | 0 | 0.0 | 0.0 | 8.0 | 5.0 | 8.0 |
5 rows × 22 columns
## transform BMI into categorical variables.
diabetes_df_cleaned['BMI_range'] = pd.cut(x=diabetes_df_cleaned['BMI'], bins=[0, 18, 25, 30, 600])
diabetes_df_cleaned['BMI_range'] = diabetes_df_cleaned['BMI_range'].astype('str')
diabetes_df_cleaned.BMI_range = diabetes_df_cleaned.BMI_range.str.replace('(', '')
diabetes_df_cleaned.BMI_range = diabetes_df_cleaned.BMI_range.str.replace(']', '')
diabetes_df_cleaned.BMI_range = diabetes_df_cleaned.BMI_range.str.replace('0, 18', 'Below 18')
diabetes_df_cleaned.BMI_range = diabetes_df_cleaned.BMI_range.str.replace('30, 600', 'Above 30')
diabetes_df_cleaned.head()
<ipython-input-71-f9bb60cadb7e>:4: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True. <ipython-input-71-f9bb60cadb7e>:5: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
| Diabetes_binary | HighBP | HighChol | CholCheck | BMI | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | Fruits | ... | NoDocbcCost | GenHlth | MentHlth | PhysHlth | DiffWalk | Sex | Age | Education | Income | BMI_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 1.0 | 26 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 3.0 | 5 | 30 | 0.0 | 1.0 | 4.0 | 6.0 | 8.0 | 25, 30 |
| 1 | 0.0 | 1.0 | 1.0 | 1.0 | 26 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 3.0 | 0 | 0 | 0.0 | 1.0 | 12.0 | 6.0 | 8.0 | 25, 30 |
| 2 | 0.0 | 0.0 | 0.0 | 1.0 | 26 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 1.0 | 0 | 10 | 0.0 | 1.0 | 13.0 | 6.0 | 8.0 | 25, 30 |
| 3 | 0.0 | 1.0 | 1.0 | 1.0 | 28 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 3.0 | 0 | 3 | 0.0 | 1.0 | 11.0 | 6.0 | 8.0 | 25, 30 |
| 4 | 0.0 | 0.0 | 0.0 | 1.0 | 29 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0.0 | 2.0 | 0 | 0 | 0.0 | 0.0 | 8.0 | 5.0 | 8.0 | 25, 30 |
5 rows × 23 columns
diabetes_df_cleaned = pd.get_dummies(diabetes_df_cleaned,columns=['BMI_range'])
Next, I will remove BMI and BMI_range_18, 25 to avoid multicollinearity. PhysHlth will also be removed because of high correlation to DiffWalk according to the heat map in the EDA section.
diabetes_df_cleaned.drop(['BMI','BMI_range_18, 25','PhysHlth'], axis=1, inplace=True)
diabetes_df_cleaned.head()
| Diabetes_binary | HighBP | HighChol | CholCheck | Smoker | Stroke | HeartDiseaseorAttack | PhysActivity | Fruits | Veggies | ... | GenHlth | MentHlth | DiffWalk | Sex | Age | Education | Income | BMI_range_25, 30 | BMI_range_Above 30 | BMI_range_Below 18 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | ... | 3.0 | 5 | 0.0 | 1.0 | 4.0 | 6.0 | 8.0 | 1 | 0 | 0 |
| 1 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 3.0 | 0 | 0.0 | 1.0 | 12.0 | 6.0 | 8.0 | 1 | 0 | 0 |
| 2 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | ... | 1.0 | 0 | 0.0 | 1.0 | 13.0 | 6.0 | 8.0 | 1 | 0 | 0 |
| 3 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | ... | 3.0 | 0 | 0.0 | 1.0 | 11.0 | 6.0 | 8.0 | 1 | 0 | 0 |
| 4 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | ... | 2.0 | 0 | 0.0 | 0.0 | 8.0 | 5.0 | 8.0 | 1 | 0 | 0 |
5 rows × 23 columns
X = diabetes_df_cleaned.drop(columns=['Diabetes_binary'])
y = diabetes_df_cleaned['Diabetes_binary']
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, # reserve 30% data for testing
stratify=y, # use stratified sampling
random_state=1)
Again, I will test the significance for each feature before I run the Logistic Regression model for diabetes dataset. Every features in diabetes dataset are significant( p-value < 0.05) so I do not need to remove any columns.
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
print(result.summary())
Optimization terminated successfully.
Current function value: 0.530307
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: Diabetes_binary No. Observations: 48838
Model: Logit Df Residuals: 48816
Method: MLE Df Model: 21
Date: Fri, 17 Dec 2021 Pseudo R-squ.: 0.2349
Time: 17:34:33 Log-Likelihood: -25899.
converged: True LL-Null: -33851.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------
HighBP 0.7816 0.023 33.347 0.000 0.736 0.828
HighChol 0.6138 0.022 27.395 0.000 0.570 0.658
CholCheck -0.6227 0.060 -10.380 0.000 -0.740 -0.505
Smoker -0.0958 0.022 -4.326 0.000 -0.139 -0.052
Stroke 0.1637 0.049 3.343 0.001 0.068 0.260
HeartDiseaseorAttack 0.3591 0.034 10.563 0.000 0.292 0.426
PhysActivity -0.1502 0.025 -6.034 0.000 -0.199 -0.101
Fruits -0.0770 0.023 -3.342 0.001 -0.122 -0.032
Veggies -0.1771 0.028 -6.412 0.000 -0.231 -0.123
HvyAlcoholConsump -0.7927 0.057 -13.825 0.000 -0.905 -0.680
AnyHealthcare -0.6513 0.053 -12.244 0.000 -0.756 -0.547
NoDocbcCost -0.3283 0.039 -8.389 0.000 -0.405 -0.252
GenHlth 0.3796 0.012 31.986 0.000 0.356 0.403
MentHlth -0.0082 0.001 -5.468 0.000 -0.011 -0.005
DiffWalk 0.1827 0.030 6.142 0.000 0.124 0.241
Sex 0.1988 0.023 8.817 0.000 0.155 0.243
Age 0.0920 0.004 21.333 0.000 0.084 0.100
Education -0.2408 0.011 -21.596 0.000 -0.263 -0.219
Income -0.0862 0.006 -14.136 0.000 -0.098 -0.074
BMI_range_25, 30 0.4023 0.028 14.269 0.000 0.347 0.458
BMI_range_Above 30 1.0279 0.029 35.466 0.000 0.971 1.085
BMI_range_Below 18 -0.5885 0.128 -4.610 0.000 -0.839 -0.338
========================================================================================
from sklearn.linear_model import LogisticRegression
model_LR = LogisticRegression(penalty='l2', max_iter=1000,random_state=1)
model_LR.fit(X_train,y_train.values)
y_predict = model_LR.predict(X_test)
accuracy = accuracy_score(y_test, y_predict).round(4)
print(f"The accuracy is: {(accuracy*100).round(4)}%")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)
# save the results for later comparison
accuracy_LR = accuracy
cm_LR = cm
The accuracy is: 74.87% The confusion matrix is: [[7676 2862] [2399 7994]]
coef_table_diabetes = pd.DataFrame(list(X_train.columns)).copy()
coef_table_diabetes.insert(len(coef_table_diabetes.columns),"Coefs",model_LR.coef_.transpose())
coef_table_diabetes['Odd'] = np.exp(coef_table_diabetes.Coefs)
coef_table_diabetes.sort_values(by=['Odd'], ascending=False)
| 0 | Coefs | Odd | |
|---|---|---|---|
| 2 | CholCheck | 1.313998 | 3.721022 |
| 20 | BMI_range_Above 30 | 1.224580 | 3.402738 |
| 0 | HighBP | 0.726354 | 2.067528 |
| 1 | HighChol | 0.594322 | 1.811801 |
| 19 | BMI_range_25, 30 | 0.573442 | 1.774364 |
| 12 | GenHlth | 0.561240 | 1.752845 |
| 15 | Sex | 0.262881 | 1.300673 |
| 5 | HeartDiseaseorAttack | 0.244840 | 1.277417 |
| 16 | Age | 0.146614 | 1.157907 |
| 4 | Stroke | 0.137768 | 1.147709 |
| 14 | DiffWalk | 0.115979 | 1.122972 |
| 10 | AnyHealthcare | 0.067923 | 1.070283 |
| 3 | Smoker | 0.005178 | 1.005192 |
| 13 | MentHlth | -0.005700 | 0.994316 |
| 6 | PhysActivity | -0.014903 | 0.985208 |
| 11 | NoDocbcCost | -0.022802 | 0.977456 |
| 7 | Fruits | -0.036106 | 0.964538 |
| 17 | Education | -0.043227 | 0.957694 |
| 18 | Income | -0.061062 | 0.940765 |
| 8 | Veggies | -0.069100 | 0.933234 |
| 21 | BMI_range_Below 18 | -0.406354 | 0.666075 |
| 9 | HvyAlcoholConsump | -0.733739 | 0.480111 |
For both models, I think 73.04% accuracy and 74.87% accuracy are not strong enough if we want to use these models to predict whether a person has Cardiovascular Disease or Diabetes in real life. However, these models still can help us to identify risk factors from features in datasets. The results support my first hypothesis that High Blood Pressure is positively correlated to both Cardiovascular Disease and Diabetes. To test my first hypothesis, we can focus on the odd from coefficient table. It is clear that hypertension is postively related to the Cardiovascular Disease. From the coefficient table, as variable ap_hi increases by one unit, the odds that this person has Cardiovascular Disease are over 2.7x as large as the odds that he/she dose not have Cardiovascular Disease. For Diabetes Health Indicators Dataset, as variable HighBP increases by one unit(From 0 to 1), the odds that this person is in the target class (Diabetes_binary = 1) are over 2.1x as large as the odds that he/she is not in the target class (“1”).
From the coefficient tables, I do not have supporting evidence to prove my second hypothesis that Hypertension is the top 1 cause of both Cardiovascular Disease and Diabetes. For Cardiovascular Disease, based on the features in the datasets, cholesterol_well above normal, ap_hi_std, age_range_Above 60, cholesterol_above normal and BMI_range_Above 30 are the top 5 risk factors. I can tell that Cholesterol level is the most important indicator. For Diabetes Health Indicators Dataset, based on the model result, CholCheck, BMI_range_Above 30, HighBP, HighChol and BMI_range_25, 30 are the top 5 risk factors. BMI and CholCheck are more important than HighBP.
In conclusion, although High Blood Pressure is not the top 1 cause of both Cardiovascular Disease and Diabetes, it is still highly correlated to Cardiovascular Disease and Diabetes. Therefore, avoiding high blood pressure can still help you stay healthy and keep you away from expensive medical expenses in the future. In addition to that, having regular Cholesterol check and healthy BMI are also important for people to avoid Cardiovascular Disease and Diabetes.
The highest odd I observed is that Smoker has odd 1.005192 during Diabetes regression analysis. The models suggest that smoking or drinking does not make people more susceptible to cardiovascular disease or diabetes.
The odd of gender_Female(1 represent that female equals to true) from Cardiovascular dataset is 0.951921 and odd of Sex(1 represent male) from diabetes dataset is 1.300673.
Diseases analysis is a trending topic nowadays. By incorporating machine learning or AI, hospitals can better serve patients and supports doctors in medical diagnosis. Although my project is mainly focusing on helping us to understand the correlation between hypertension and Cardiovascular Disease and Diabetes, the machine learning model can be used to support medical diagnosis if I can further improve the accuracy of prediction. To achieve that goal, I can first optimize the accuracy by changing the parameters during model construction or incorporating other models. The second way to improve the model is to have more features included.
Here is an interesting artical disscuing machine learning in healthcare: https://healthitanalytics.com/features/how-machine-learning-is-transforming-clinical-decision-support-tools